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Mixture models have received considerable attention recently and 
Newton [Sankhyd Ser. A 64 (2002) 306-322] proposed a fast recur- 
sive algorithm for estimating a mixing distribution. We prove almost 
sure consistency of this recursive estimate in the weak topology under 
mild conditions on the family of densities being mixed. This recursive 
estimate depends on the data ordering and a permutation-invariant 
modification is proposed, which is an average of the original over 
permutations of the data sequence. A Rao-Blackwell argument is 
used to prove consistency in probability of this alternative estimate. 
Several simulations are presented, comparing the finite-sample per- 
formance of the recursive estimate and a Monte Carlo approximation 
to the permutation-invariant alternative along with that of the non- 
parametric maximum likelihood estimate and a nonparametric Bayes 
estimate. 

1. Introduction. Mixture distributions have played a key role in model- 
ing data that reflect population heterogeneity, contain indirect observations 
or involve latent variables. In recent years, these models have been widely 
used in genetics, bioinformatics, proteomics, computer vision, speech analy- 
sis and a host of other research areas; see, for example, [1, 5, 7, 21, 25, 27, 31]. 
Fitting a mixture model has been made easy by the advent of computational 
techniques such as the Expectation Maximization (EM) and the Markov 
Chain Monte Carlo (MCMC) algorithms. Recovering the underlying mixing 
distribution, however, continues to pose a serious challenge. 

Newton, et al. [22, 23, 24] introduced a fast, recursive algorithm for es- 
timating a mixing density when a finite sample is available from the corre- 
sponding mixture model. Suppose Xi , . . . , Xn are independently distributed 
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(iid) according to the density 

(1) p{x)= f p{x\e)F{de), 

Je 

where p{x\9) is a known samphng density, with respect to a dominating a- 
finite measure i' on X, parametrized by € 0. Assume also that the mixing 
distribution F is absolutely continuous with respect to some cj-finite measure 
fi on 0. Newton [22] proposed to estimate / = dF/dfi as follows. 

Recursive algorithm. Fix an initial estimate /o and a sequence of 
weights wi,W2, ■ ■ ■ £ (0,1). Given i.i.d. observations Xi,. . . , Xn from the mix- 
ture density p{x) in (1), compute 

(2) , + 

for i = 1, . . . ,n and produce fn as the final estimate. 

This method of estimating / has a number of advantages over the exist- 
ing mainstream methods found in the literature. To begin with, it is rather 
straightforward to accommodate prior information regarding support and 
continuity properties of / through those of /q. For example, if p{xi\9) > 
for all G 0, then supp(/j) = supp(/i_i). Therefore, by choosing /o appro- 
priately one can ensure that has the same support as the target /. This 
flexibility is not offered by the method of nonparametric maximum likeli- 
hood (NPML) estimation of F; cf. Laird [16] and Lindsay [18] which pro- 
duces an estimate supported on at most n points. It is also evident that the 
recursive algorithm above applies to any arbitrary sampling density p{x\9), 
making this method more general than deconvolution methods which deal 
exclusively with sampling densities of the type p{x\6) = ip{x — 9) for some 
density ip. However, a lot is known about deconvolution methods; see, for 
example [9, 30, 35]. 

The flexibility associated with /„ resembles those found in a Bayesian 
framework. Indeed, for n = 1, the estimate fn is precisely the posterior mean 
of / under the Bayesian formulation that a priori / follows a Dirichlet pro- 
cess (DP) distribution [8, 10] with base measure /o and precision 1/wi — 1. 
Newton's original motivation for the recursive algorithm was based on this 
fact [23], though this analogy breaks down for 77, > 1. In particular, /„, for 
n > 1, depends on the particular order in which XiS enter the recursion and 
hence is not a function of the sufficient statistic (^(i), • ■ • iXi^^-^) — the order 
statistics. Consequently, /„ cannot equal any posterior quantity. To further 
distinguish the two estimates, computation of a nonparametric Bayes esti- 
mate /npb based on the DP prior requires a rather elaborate Monte Carlo 
procedure, while the recursive estimate can be computed many times faster. 
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It was hoped that /„ would serve as a computationally efficient approxima- 
tion to /npb- 

It is rather difficult to study the asymptotic properties of /„ since it 
is not a Bayesian quantity, does not seem to optimize a criterion function 
such as the log likelihood and cannot be written as a linear estimator [13]. 
Nevertheless, empirical studies carried out by Newton [22] and Ghosh and 
Tokdar [12] clearly demonstrated good performance of this estimate. New- 
ton [22] also presented a proof of convergence of /„ as n ^ oo based on the 
theory of inhomogeneous Markov chains. Unfortunately, this proof had a 
gap [12]. Ghosh and Tokdar [12] used a novel martingale based argument to 
show consistency under the same conditions as in Newton [22]. A slightly 
stronger result has recently been derived by Martin and Ghosh [20] using 
a stochastic approximation representation of the algorithm. The conditions 
required in these papers are somewhat restrictive, particularly in requiring 
to be a known finite set. Ghosh and Tokdar [12] further require p{x\6) to 
be bounded away from zero on ^ x 0, while Martin and Ghosh [20] make 
the weaker assumption that p{-\9) > i/-a.e. for each 9. 

In this paper we show consistency of fn in the weak topology under quite 
general conditions. Our assumptions are slightly stronger than those needed 
in Kiefer and Wolfowitz [14] or Leroux [17] to prove consistency of the NPML 
estimate of F. In particular, Q is not required to be finite and the sampling 
density is allowed to decay to zero. We use a major extension of the basic 
martingale argument in [12] to show that the marginal density 



almost surely converges to p{x) in the Li topology. This then leads to almost 
sure weak convergence of /„ to /. This latter result applies to any arbitrary 
method of estimation — for example, it can be used to show weak consistency 
of the posterior mean of the mixing distribution under the DP formulation. 
This result holds even when @ is noncompact, as long as p{x\9) van- 
ishes at the boundary in a certain near-uniform sense. The main martingale 
argument, too, does not explicitly require much structural assumption on 
0, but assumption A5 (see Section 2) would be difficult to verify without 
compactness. 

Despite this asymptotic justification, the dependence of fn on the order 
of the observations could be a cause of concern in application, especially 
when n is not very large. In some cases a particular ordering can be justified 
by problem specific considerations. For example, "sparseness" assumptions 
(i.e., that only a small percentage of the observations come from the nonnull 
component of the mixture) led Bogdan, et al. [3] to arrange the observations 
in the ascending order of their magnitude while estimating a mixing density 
underlying a multiple testing problem. In the absence of such justification 
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a permutation invariant version of may be desirable. Newton [22] recom- 
mends calculating the average over a large number of random permutations 
which can be seen as a Monte Carlo approximation to 

{^) In— ^ ^ ^ fn,si 

where Sn is the permutation group on {1, . . . ,n} and fn,s, for s S Sn, repre- 
sents the estimate /„ with the observations arranged as . . . ,Xg(^n)- 

In Section 3 we show that fn provides a Rao-BIackwellization of /„ and 
satisfies E(i(/, fn) < /„) for many standard divergence measures d. 

This property is then exploited to show that in the weak topology, f 
in probability. Section 4 presents a simulation study of finite sample perfor- 
mance of /„ and /„ — a Monte Carlo approximation to fn ■ It is demonstrated 
that fn, which requires more computing time than fn, is still faster and more 
accurate than other existing methods, such as the NPML estimate or a NP 
Bayes estimate. Finally, in Section 5 we give some concluding remarks. 

2. Consistency of Newton's estimate. For the remainder of the paper we 
consider the following assumptions: 

Al. E^=l Wi = oo and E«=i wj < oo. 

A2. The map F ^ f p{x\9)F{d9) is injective; that is, the mixing distribution 

F is identifiable from the mixture / p[x\9)F[d9). 
A3. For each x ^ X, the map 9 '^p{x\9) is bounded and continuous. 
A4. For any e > and any compact Xq C X, there exists a compact Gq C 

such that f^^p{x\9)v{dx) < e for all 9 ^ Qq. 
A5. There exists a constant B < oo such that, for every ^i, 6*2, 6*3 G 6 

The first condition on the weights Wi is necessary for /„ to outgrow the 
influence of the initial guess /q. At the same time, the weights need to decay 
to zero to allow for accumulation of information. The square summability 
condition ensures a certain rate for this decay, suitable for the Taylor approx- 
imation approach taken here. The identifiability condition A2, necessary for 
any estimation of mixture densities, is shown in Teicher [33] to be satisfied 
by many sampling densities of interest; for example: 

• Normal with mean 9 and fixed variance > 0, 

• Gamma with rate 9 and fixed shape a > 0, 

• Poisson with mean 9. 

Each of these densities satisfy the boundedness conditions A3 as well as the 
decay property A4. These also satisfy the square integrability condition A5 
when is a compact interval. 
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Let Kn = J f log{f / fn) dfi and K* = J plog{p/pn) du denote the error 
measures according to the Kullback-Leibler (KL) divergence, where fn and 
Pn are defined in (2) and (3), respectively. On appHcation of a telescoping 
sum, it follows easily from the recursive definition of the estimates fi that 



n „ 
1=1 



l + w. 



1 



Write log(l + x) = x — x'^R{x), x > —1, where the remainder term satisfies 
< R{x) < max(l, {l+x)~^)/2. Then 



Kn-Ko = Y^ 



(5) 



where 



i=l 



Wii 1 



P{X^ 



Pi^i{Xi) 

% n 

Y^WiVi-Y^WiMi + Y^E, 



+ / R,{x,,e)f{e)^l{de) 



e 



1=1 



1=1 



i=l 



R^{x,e)=w^i 
Mi = -E 



f p{x\6) 



\Pi-i{x) 

p{X, 



X 



Pi^i{Xi) 
p{x) 



V^ = l 



Pi~i{x) 
p{Xi) 



1 1 R\wi 

- l^p{x)v{dx), 
+ Mi, 



f p{x\e) 



\Pi-i{x) 



Ei 



e 



Pi-i{X., 

R,{x,,9)f{e)ii{de). 



In the following we prove that each of Yl'^iWiVi, J2i^i^i J2i^iWiMi 
is finite with probability 1. 

Let ^i = a{Xi, . . . ,Xi) be the cr-algebra generated by the first i obser- 
vations. By definition of Vi, Sn = J27=iWiVi forms a zero mean martingale 
sequence with respect to Moreover by assumption A5, 



' n 



.i=l 



p{X,) 
Pi-i{Xi 



<2{l + B)Ywl 



i=l 
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Since X^i^i^i < oo by Al, ]E(5'^) is uniformly bounded in n. Therefore, by 
the martingale convergence theorem [6], 5„ almost surely converges to a 
random variable Soo with E(S'oo) < oo. 

Let Tn = X)r=i ^^'^ = liniT^, which always exists (since Sj's are 

nonnegative) but may equal infinity. Notice that for n > and v S (0, 1), 

(u - 1)2 max{l, (1 + v{u - 1))"^} < max{(n - l)^ {1/u - 1)^}. 

Therefore, 

wf ( p{x\6) \2 r / ( p{x\(^) 



2 IV p,-i{x)) 'V p{x\e) 

and hence, by assumption A5, 

¥.[Ei\^i.i] = [ E[R^{Xi,9)\^i.i]fi9)fi{d0)<wf{l + B). 
Je 

By Fatou's lemma and assumption Al, 



u=i 



lE(Too) < limjnf E(r„) = lim^inf E( ^E[^i|^i_i] 

n 

< liminf(l + B)^wf < oo. 



1=1 



which proves T^o is finite with probability 1. 

Now rearrange the terms in (5) and use nonnegativity of Kn to get 

n 

(6) ^u;,M,<Ko + 5„ + T;. 

i=l 

It follows from the inequality logy < y — 1 that Mj > K* > 0. Therefore 
WiMi exists but could be infinite. However, equation (6) implies 

oo 

(7) ^w;iMi<i^o + 5oo + Too < oo a.s. 

The almost sure finiteness of the three series in (5) leads to the following 
important result. 



Theorem 1. Under Al and A5, Kn — > K^q a.s. for some random vari- 
able KoQ. Moreover, K^^O a.s. on a (random) subsequence. 
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Proof. The first assertion is a simple consequence of tlie finiteness of 
the three series. The second observation follows since X^i^i '^i^i < almost 
surely while X^i^i = □ 

Next, define the quantities 

5.,x(e) = ^^^^^4^ and K^,,{x)= f p{x\e)g,y{9)fi{de), 
so that the recursive updates i-^ fi and pi^i ^ Pi are, respectively, 

h{e) = {i-wi)fi.i{e) + wigi,xM, 

Pi{x) = (1 - Wi)pi_i{x) + Wihi^x,{x)- 
Therefore, as in the case of Kn^ we could write 

K:-Ko* = -E / p(x)log[l + «;/^^^^-l)lKdx) 
±1 p(x)Lfl-^^^)+ii*(X„x) 



i=l ■ 
n 



u{dx) 



i=l 



i=l 



where 



r>*( I \ 2 1 hi x'yXj 



1 Riw 



E* 



M* 



X 



-E 



Vi-\{x) 
R*{Xi,x)p{x)iy{dx), 



Pi-i{x) 



1 



X Pi-i{x) 
hi,x'ix) 



p{x)v{dx] 



V* = 1 



xJx Pi~i{x) 



X Pi-l(x) 



p{x)p{x')v(dx)v(dx') — 1, 
p{x)i'{dx) + M* . 



Proceeding as in the lead up to Theorem 1, it can be shown that each of 
X^i^iWjV^j*, Yli^i'^i^i and Yli^i^i is finite almost surely. The required 
nonnegativity of M* is established using Jensen's inequality as follows: 

JxJx Pi-i{x)Pi-l[x') 
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/e [Jx Pi-i{x) 

>{ f f ^^MxMdx)f,.i{eMde)y -1 

[Je Jx Pi~i[x) ) 
= 0. 

From this we conclude the fohowing. 

Theorem 2. Under Al and A5, K*^0 a.s. 

Proof. It fohows from the above discussion that K* — > K^, almost 
surely, for some random variable K^. But we know from Theorem 1 that 
K* — a.s. on a subsequence. These two together imply = a.s. □ 

It follows from the above theorem that Pn ^ P in the Li topology (also 
in the topology of the Hellinger metric). We next show that under identifi- 
ability, Li consistency of pn implies weak consistency of fn via a tightness 
argument. Since this result requires no assumption on the construction of 
fn, it is presented in the next theorem in more generality than required at 
the moment. We will see some other use of it in the sequel. 

Theorem 3. Let F and Fn he probability measures on Q^with respective 
mixture densities p{x) = / p{x\6)F{d9) and pn{x) = J p{x\6) Fn{d9) . Suppose 
Pn^P in Li. Then, under A2-A4, Fn ^ F in the weak topology. 

Proof. We first show that Fn forms a tight sequence^Fix any e > 0. It 
suffices to show existence of a compact ©o C such that Fn{@o) > 1 — e for 
sufficiently large n. Take any compact C X such that J^^^pdi' > 1 — e/2. 
By A4, there exists a compact 0o such that J^^p{x\6)i'{dx) < e/2 for all 
9 ^Qq. Now apply the Li convergence of p.„ to p to conclude 

1-J</ pdi^= \im f pndi^<lmimi\Fnieo) + ^Fnie'o) 

Thus, Fn is tight and the final^assertion will follow once we show j3very 
weakly convergent subsequence -F„(fc) converges to F. Now, if -F„(fc) — > F* for 

some F* e ^{@) then, by assumption A3, Pn{k) P* pointwise and hence 
in the Li topology (via Scheffe's theorem), where p*{x) = J p{x\9)F* (dO). 
Therefore p* = p, which, under A2, implies F* = F. □ 

The following result precisely states what we have already proved regard- 
ing consistency of fn- 

Corollary 4. Under A1-A5, the estimate fn obtained from (2) con- 
verges almost surely to f in the weak topology. 
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3. Averaging over permutations. It is easy to see that the permutation 
averaged estimate /„ can be written as /„ = IE[/„ | -'^(i), • • • ,-^(n)]- Let p„ 
denote the corresponding mixture density 

p^{x) = I p{x\e)Ue)ii{de). 

Then p„ also satisfies p„ = E[p„|X(i), . . . Therefore /„ and pro- 

duce a Rao-Blackwellization of /„ and pn, respectively, by making these 
functions of the sufficient statistic — the order statistics. As one might guess, 
this results in a smaller expected error in estimation, when error is measured 
by a divergence d that is convex in the estimate 

Ed(/, fn) = Ed(/,E[/„|X(i), . . . , 

<E{E[d(/,/„)|X(i),...,X(„)]} 
= Erf(/,/n) 

and similarly, Ed{p,p^) < E,d{p,pn). Examples of such divergence measures 
d include the KL divergence and the Li distance. 

We next show that the above result leads to weak convergence of fn 
to /. However, we prove convergence only in probability and not almost 
surely. Recall that 1^ — > y in probability if and only if every subsequence 
rik contains a further subsequence n^f^i) such that i^nj.;;) Y a.s., whenever 
the underlying topology is metrizable. 

Theorem 5. Under A1-A5, fn converges weakly to f in probability. 

Proof. From Theorem 2 it follows that — Pnlli a.s. Since the Li 
distance is bounded by 2, it follows by the dominated convergence theorem 
that K\\p — PnWi 0. Rao-Blackwellization implies E||p — —>■ and, 
hence, \\p — — > in probability. Take an arbitrary subsequence n^. It 
must contain a further subsequence nfc(j) such that 111 ~* a-s- Then 

Theorem 3 implies that /nj.;,) f a.s. in the weak topology. The assertion 
follows since the weak topology is metrizable. □ 

Remark 6. Even for moderate n, there are too many permutations to 
compute Pn exactly, so the Monte Carlo estimate pn is used as a numerical 
approximation. Therefore, what we can conclude from Theorem 5 is a sort of 
practical consistency of pn', that is, for large n and sufficiently many random 
permutations, Pn ~Pn and Pn which implies Pn ~P- 
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4. Simulations. The numerical results in [12, 22] show that /„ performs 
well in a variety of problems. In the following subsections we compare, more 
extensively, the performance of the recursive estimate (RE) and the recursive 
estimate averaged over permutations (PARE), starting with initial guess 
/o, with that of several popular competitors, namely, the nonpar ametric 
maximum likelihood estimate (MLE) and the nonparametric Bayes estimate 
(NPB) based on a Dirichlet process prior with base measure /o and precision 
constant set to 1. While RE and PARE are easy to compute, computation 
of MLE and NPB is nontrivial. For the MLE, we implement an efficient new 
algorithm of Wang [34]. To find NPB, we employ a new importance sampling 
method, based on a collapsing of the Polya Urn scheme; see the Appendix. 
We set the following simulation parameters: 

• T = 100 samples of size n = 200 are taken from the model. 

• For PARE, 100 random permutations of the data are selected. 

• For RE and PARE, the weights satisfy Wi = {i + 

• For NPB, R = 10,000 importance samples are used; see (12). 

The efficiency of our NPB algorithm is measured by the effective sample size 
(ESS) [19]. For an importance sample of size R the ESS, given by 

R 

ESS = p — 

1 + varjwj, . . ■■,uj*p^\ 

estimates the size of an "equivalent" i.i.d. sample from the posterior distri- 
bution of /, where oj* is a normalization of the weight u;,. in (11). 

Remark 7. Estimation of a mixing distribution in the Dirichlet process 
mixture (DPM) formulation is an extremely difficult problem. The current 
Monte Carlo approaches for DPM models, including the one proposed in the 
Appendix, are based on some sort of exploration of the space of clustering 
configurations of the observations. Unfortunately, the conditional expecta- 
tion of the mixing distribution, given the clustering, is highly variable; much 
more so than the conditional expectation of the mixture density. Conse- 
quently, one needs an thorough exploration of the clustering space to obtain 
a reliable estimate of the mixing distribution. This is nearly impossible to 
achieve in finite time as this space grows exponentially with the number of 
observations. 

4.1. Regular mixtures. In this subsection we will consider two regular 
mixture models — regular in the sense that / is a density with respect to 
Lebesgue measure and smooth on its interval of support — namely, the Beta- 
Normal (BN) and the Gamma-Poisson (CP) mixtures 

(BN) ^i~iBeta(3,30) + |Beta(4,4), X^l^^ ~ iV(ei, a^). 
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(GP) Oi ~ TruncGamma(2,0.4), Xi\ei ~ Poisson(6'i). 

In each case, the samples are independent across i = 1, . . . , n. Here the usual 
Gamma(2, 0.4) distribution is truncated to G = [0,50]. One can easily check 
that conditions A2-A5 are verified for these models; in particular, A5 follows 
immediately from the compactness of 0. For (BN) we choose a = 0.1 but 
our conclusions hold for a range of a containing 0.1. We also choose /o to 
be a Unif(0) density in each case. 

Figures 1 and 2 display the estimates for model (BN) and (GP), respec- 
tively. In each figure, the upper left-hand cell shows the four estimates for 
a randomly selected run, while the other cells show the corresponding 100 
estimates. The traditional estimates — MLE and NPB — of / are quite poor, 
with the MLE being discrete and the NPB being very spiky; see Remark 7. 
Note that the average ESS over the 100 datasets for the (BN) and (GP) 
models are 538 and 324, respectively. On the other hand, RE and PARE 
are much more stable across samples. Moreover, as expected from the Rao- 
Blackwellization, we see less variability in the PARE than in the RE, in the 
sense that /„ hugs the true / closer than does /„. 

That the sampling distribution is discrete in model (GP) has an interest- 
ing implication. In Figure 2 there is a (false) peak at zero for the mixture 
density pn ■ This is due to the fact that the data Xi,. . . , Xn were generated 
by replicating each value according to its count. That is, the data sequence 
consists of all the Os first, followed by all the Is, etc. Therefore, permutation 
is necessary for count data stored in a particular deterministic order. 

Table 1 displays the mean computation time (in seconds). In each case, 
the computation time for PARE is significantly less than that of NPB. In the 
Beta-Normal example, PARE is more than 10 times faster than the MLE, 
but the latter is only slightly more efficient in the Gamma-Poisson example. 
One explanation for this discrepancy is that PARE must process each Xi 
individually, whereas the MLE allows for a reduction to the frequency table 
rix = :Xi = x}, which can result in a significant decrease in computation 
time, especially when the sample size n is large. 

Figure 3 summarizes the Li distances Li{p,p) (left) as well as what we call 
a bias-spread summary (right) for the 100 estimates in the two regular ex- 
amples. This bias-spread summary is similar to the traditional bias- variance 



Table 1 

Mean computation time {m seconds) for PARE, MLE and NPB over the T = 100 
samples. RE (not displayed) is about 100 times faster than PARE 



Model 


PARE 


MLE 


NPB 


BN 


0.14 


1.11 


43.77 


GP 


0.12 


0.20 


31.41 
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Fig. 1. Plots of the mixing density estimates (top row) and corresponding mixture density 
estimates (bottom row) for model (BN). The upper left-hand cell shows all four mixing 
density estimates for a randomly selected run, while the remaining cells show the true f 
or p (black) with the T = 100 estimates (gray). 



decomposition of mean-square error: if pnt is an estimate of p based on the 
tth sample = 1, . . . , T) of size n, then 

(8) Bias= / \pn--p\du and Spread = ^ / \pnt-Pn-\du, 

where Pn-{x) = T~'^ Ylt=i'Pnt{x) is the point-wise average of the T estimates 
of p{x). We consider the sum of the bias and spread as a measure of overah 
variabihty and look at how the two components contribute to the sum. 
In both examples, PARE performs better in terms of overall variability, 
spread and, most importantly, Li loss. Compared to the other estimates, 
it appears that PARE does a better job of simultaneously controlling bias 
and spread. In the Beta-Normal example, RE also performs well. Due to 
the deterministic ordering issue mentioned above, RE performs quite poorly 
for (GP) and is not displayed. Note that these relative comparisons remain 
the same when the L\ distance is replaced by the KL divergence. 




Fig. 2. Plots of the mixing density estimates (top row) and corresponding mixture density 
estimates (bottom row) for model (GP). The upper left-hand cell shows all four mixing 
density estimates for a randomly selected run, while the remaining cells show the true f 
or p (black) with the T = 100 estimates (gray). 



4.2. Irregular mixture. For an irregular mixture, we take / to have both 
a discrete and an absolutely continuous component. In particular, consider 
the Irregular-Normal (IN) mixture 

(IN) 9i ~ |5|o} + |TVuncNormal(0,4), XilOi ~ iV(0i, 1), 

where the samples are independent across i = 1, . . . , n, S^^y denotes a point- 
mass at zero, and the usual A^(0, 4) distribution is truncated to = [—10, 10] . 
Note that the choice of dominating measure ^ is Lebesgue measure on 
plus a unit mass at zero. The initial guess/hyperparameter /o is taken to be 
5^{o} + |Unif (0) density. In this subsection we focus on just the PARE and 
NPB estimates, the top two performers in Section 4.1. 

Figure 4 shows the 100 estimates of the the absolutely continuous part 
/ac of the mixing distribution as well as the corresponding estimates of 
the mixture. Just as in Section 4.1, we see PARE has considerably less 
variability than NPB (with an average ESS of about 330) on the 0-scale, 
while both perform comparably on the x-scale. The left-most plot in Figure 
5 summarizes the 100 estimates tt of vr = Pr(0 = 0). Both procedures tend 
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MLE 
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PARE MLE NPB 



Fig. 3. Summary of the Li distance Li{p,p) (left column) and Bias-Spread tradeoff 
(right column) for models (BN) (top row) and (GP) (bottom row). 



to overestimate vr = 0.667 (horizontal line). Most likely, this is because /ac 
is also fairly concentrated around = 0. The right two plots in Figure 5 
summarize Li{p,p) and the bias-spread over the 100 samples. PARE, again, 
tends to be much more accurate under Li loss: on average, Li(p,pnpb) is 
about 34% larger than Li{p,pn). Also, PARE seems to handle the twin 
bias-spread problems better than NPB. 



4.3. Massive data example. The irregular mixture (IN) in Section 4.2 
arises in many important applications. In microarray analysis [29] or quan- 
titative trait loci (QTL) mapping [3], each 6 represents the expression level of 
a single gene or the association level of a single genetic marker, respectively. 
For the nonparametric regression problem [4], the 0's represent coefficients 
of, say, a wavelet basis expansion of the regression function. In each example, 
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Fig. 4. Plots of the absolutely continuous parts of the mixing distributions (top row) and 
corresponding mixture density estimates (bottom row) for model (IN). The true /ac or p 
are shown in black with the T = 100 estimates in gray. 

the ^-vector is assumed to be sparse in the sense that most of the ^'s are 
zero. To account for sparseness, a Bayesian formulation assumes that the 0's 
are independent observations from a common prior distribution 

(9) F{d9) = ^5{o} (dO) + (1 - ^)/ac(e) d9. 

A fuhy Bayesian analysis can be difficult in these applications: the results are 
very sensitive to the choice of hyperparameters (vr, /ac) [4, 29]. However, the 
dimension n of the 0- vector is quite large so an empirical Bayes approach [28] 
is a popular alternative. It was shown in Section 4.2 that both the PARE 
and NPB can be used to estimate (vr,/ac), but when n is extremely large, 
computation becomes much more expensive, particularly for NPB. 

We take a simulated dataset of size n = 50,000 from the model (IN) in 
Section 4.2. Figure 6 shows the PARE and NPB estimates of (vr,/ac) in (9). 
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IN:ft = Pi<9 = 0) IN:L,(p,p) IN: Bias-Spread 




□ Spread 
■ Bias 




Fig. 5. Summary of the estimates of tt — Pt{9 = 0) (left), summary of the L\ distance 
Li{p,p) (middle) and Bias-Spread tradeoff (right) for model (IN). 



Although the PARE has two modes, it is a much closer approximation to the 
true /ac compared to the spiky NPB estimate. An important point is that, 
even with 10,000 importance samples, the ESS is only 1 ; see Remark 7. 
The estimates vr are also displayed and ttpare = 0.733 and ttnpb = 0.772 
are both slightly larger than the target tt = 0.667. Figure 6 also shows the 
estimates p of the marginal density p. With n = 50,000, Li-consistency of 
Pnpb [2, 11] and pn has kicked in, and that of NPB and PARE follows by 
Remark 6. What is perhaps most important in massive data problems — 
where almost any estimate will perform well — is computational efficiency. 
Here, the PARE was obtained in 4^5 seconds, while the NPB estimate took 
nearly 6 hours. 

Evidence suggests that PARE is a much better procedure than NPB in 
this problem for an empirical Bayes analysis. Compared to NPB, the PARE 
algorithm is easier to implement, the computation is significantly faster, and 
the resulting estimates of (vr, /ac) are much more accurate. 

5. Discussion. The previous analyses in [12, 20] fell short of proving 
strong consistency of recursive estimate in the general case, each only 
establishing convergence for the case of known and finite G. Here a more 
general theorem is proved by extending the martingale approach taken by 
Ghosh and Tokdar [12], namely, by extending the approximate martingale 
representation of K{f,fn) on the ©-space to K{p,pn) on the A'-space. That 
the KL is the appropriate divergence measure to use for our purposes is 
not immediately clear, but the stochastic approximation representation of 
Newton's algorithm for known finite along with the Lyapunov function 
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properties of the KL divergence shown in Martin and Ghosh [20] , show that 
the KL divergence is, indeed, quite natural. This stochastic approximation 
representation of the recursive algorithm in [20] continues to hold for more 
general @ and we speculate that an alternative proof of convergence can be 
given based on this fact. Unfortunately, definitive, ready-to-use results on 
convergence of stochastic approximation algorithms in such general spaces 
are, to our knowledge, not yet available. 

The failure of these previous analyses [12, 20] suggested that sample paths 
of the recursive estimate were, in some sense, unstable. In keeping with the 
stochastic approximation representation of the algorithm, we considered a 
stabilized version of /„, namely, 

n=im ' 

which is a weighted average of the iterates fi of the recursive algorithm. This 
technique of averaging the iterates, common in the stochastic approximation 
literature, can often improve stability properties of the algorithm, such as 
decreasing the variance of an estimate reached in finite time or increasing 
the rate of convergence; see, for example, Kushner and Yin [15] . While fmW 
performs quite poorly compared to /„ in the cases we considered, it was in 
proving fn-.w^f that Theorem 3 was discovered, opening the door to the 
consistency results presented in Section 2. 

In simulations (including others not presented here), we have observed 
that /„ converges quite rapidly to the true mixing density /. For weights of 
the form Wi = [i"- + 1)"^ for a G (0.5, 1], the convergence was typically fastest 
for a = 1. These simulation results, together with the stochastic approxima- 
tion representation [20] of the recursive algorithm and the well-known results 




Fig. 6. Plot of the absolutely continuous part of the mixing distribution (left) and corre- 
sponding mixture density estimates (right) for model (IN) in the massive data example. 
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on convergence of stochastic approximation algorithms [15], suggest the fol- 
lowing conjecture: K* = Op{w^) for some P € (0,1). While the numerical 
evidence is consistent with this conjecture, the rate of convergence remains 
an open problem. 

A drawback of the recursive algorithm is that it cannot handle an ad- 
ditional unknown parameter ^ in p{-\0), such as an unknown ^ = o"^ when 
p{-\0) is a N{6,a'^) density. Martin and Ghosh [20] tackle this problem when 
replicates Xn, . . . , Xir are available from p{-\0i,C)- The general idea is to use 
a suitable estimate = S,{Xi, . . . ,Xi) of ^, based on the first i observations, 
as a plug-in in the update i— > fi in 2. This procedure has performed well 
in a variety of simulations but convergence results are known only for the 
case of finite Q [20] . The proof of convergence in [20] is based on a stochas- 
tic approximation representation of the algorithm and, therefore, does not 
easily extend to more general 0. 

The numerical illustrations given here, as well as in [12, 20, 22], suggest 
that RE and PARE perform quite well in a variety of problems compared to 
other alternatives, such as MLE or NPB. While these alternatives are pop- 
ular and have well-known convergence properties, which provide practical 
and theoretical justification for their use in applications, they lack the com- 
putational efficiency of the recursive algorithm and often produce very poor 
estimates. Even if one insists on a more traditional analysis, RE or PARE 
could be used in a computationally inexpensive preliminary analysis [26] to 
help choose an appropriate model to be fit to the observed data. 

The theoretical results of the present paper establish the consistency prop- 
erties the recursive algorithm was lacking which, combined with its gener- 
ality, strong finite-sample performance and speedy computation, should put 
fn and fn among the front-runners of mixing density estimates. 

APPENDIX: A NEW ALGORITHM FOR NPB 
Consider the Dirichlet process mixture (DPM) model 

independently for i = 1, . . . ,n, where p{x\6) is the likelihood function, the 
density / on O is the parameter of interest, and ^(c, /o) is a Dirichlet process 
distribution with precision parameter c > and base density /q. Ferguson 
[10] shows that if the ^j's were observed, then the posterior distribution / is 
easily obtained. However, special techniques such as data augmentation [32], 
are needed when only the indirect observations available. In 

this section, we briefly outline a new approach to this problem. 

In this approach, the mixing parameters 0j's are collapsed onto only the 
clustering configuration s = (si, . . . where Sj's are defined sequentially 
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as follows: si = 1 and, for i = 2, . . . ,n 

{Sj, if there is a j <i such that 6i = 9j, 

1 + max s j , otherwise. 
j<i 

Like in Liu [19], the basic idea is to sequentially generate from 

(10) p{st\xi,si,...,xt-i,st-i,xt), t = l,...,n 
and calculate the importance weight 

n 

(11) LO = p{xi)Y[p{xt\xi,Sl, . . . , Sj„i). 

t=2 

The current method differs from that of Liu [19] in two important ways. 
First, simulation of s in 10 requires no advanced sampling techniques while 
the computational complexity of Liu's step (A) is problem-specific. Second, 
the conditional mean fi^) =E{f\x,s) of the mixing density given the data 
x and the clustering configuration s can be easily calculated: 



fis) ^ _1 



c + n 



M 



where M = maxj Sj is the total number of clusters, ne = #{j : Sj = 1} are the 
cluster sizes and /^^^ are the cluster specific "parametric" posterior mean 
densities given by 

/W(0)oc n pi^.mfoio). 

j:Sj=e 

These calculations are summarized in the following algorithm. 

1. Set M = 1, si = 1, ni = 1, 

2. For i = 2, . . . , n repeat 

(a) Set qo = cj p{xi\6) fQ{6) fi{d9) and compute 

q, = ne J p{x,\e)f'^'\9)ii{de), i = l,...,M. 

(b) Update a; <— ti;X]*=o ^^/('^ + ^ ~ -*-)• 

(c) Draw m from {0, 1, ... , M} with probabilities {po,pi,. . . ,pm) where 
Pi oc qi. 
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(d) If m = 0, then update M ■>— M + 1, set Si = M, nM = 1 and 




Otherwise, set Si = m and update rim. + 1 and 




Jp{xi\9')f("'){9')n{d9')' 



Steps 1 and 2 are repeated R times independently, producing estimates /^^''•' 
and weights ojr, for r = 1, . . . , i?. Then, based on the identity /npb = IE[/('^)], 
the posterior mean is approximated by the weighted average 




R 

r=l 



Note, finally, that permuting the observations xi,. . . ,Xn before each of these 
R iterations can greatly improve the efficiency of the algorithm. 
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